{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "stopped-scale",
   "metadata": {},
   "source": [
    "# Week 2: Diving deeper into the BBC News archive\n",
    "\n",
    "Welcome! In this assignment you will be revisiting the [BBC News Classification Dataset](https://www.kaggle.com/c/learn-ai-bbc/overview), which contains 2225 examples of news articles with their respective labels. \n",
    "\n",
    "This time you will not only work with the tokenization process, but you will also create a classifier using specialized layers for text data such as Embedding and GlobalAveragePooling1D.\n",
    "\n",
    "#### TIPS FOR SUCCESSFUL GRADING OF YOUR ASSIGNMENT:\n",
    "\n",
    "- All cells are frozen except for the ones where you need to submit your solutions or when explicitly mentioned you can interact with it.\n",
    "\n",
    "- You can add new cells to experiment but these will be omitted by the grader, so don't rely on newly created cells to host your solution code, use the provided places for this.\n",
    "\n",
    "- You can add the comment # grade-up-to-here in any graded cell to signal the grader that it must only evaluate up to that point. This is helpful if you want to check if you are on the right track even if you are not done with the whole assignment. Be sure to remember to delete the comment afterwards!\n",
    "\n",
    "- Avoid using global variables unless you absolutely have to. The grader tests your code in an isolated environment without running all cells from the top. As a result, global variables may be unavailable when scoring your submission. Global variables that are meant to be used will be defined in UPPERCASE.\n",
    "\n",
    "- To submit your notebook, save it and then click on the blue submit button at the beginning of the page.\n",
    "\n",
    "Let's get started!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "magnetic-rebate",
   "metadata": {
    "deletable": false,
    "editable": false,
    "id": "gnwiOnGyW5JK",
    "tags": [
     "graded"
    ]
   },
   "outputs": [],
   "source": [
    "import io\n",
    "import tensorflow as tf\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import pickle"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d0aff8f4",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "import unittests"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "lightweight-cambridge",
   "metadata": {},
   "source": [
    "For this assignment the data comes from a csv. You can find the file `bbc-text.csv` under the `./data` folder. \n",
    "Run the next cell to take a peek into the structure of the data. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "floppy-stuff",
   "metadata": {
    "deletable": false,
    "editable": false,
    "tags": []
   },
   "outputs": [],
   "source": [
    "with open(\"data/bbc-text.csv\", 'r') as csvfile:\n",
    "    print(f\"First line (header) looks like this:\\n\\n{csvfile.readline()}\")\n",
    "    print(f\"The second line (first data point) looks like this:\\n\\n{csvfile.readline()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bizarre-veteran",
   "metadata": {},
   "source": [
    "As you can see, each data point is composed of the category of the news article followed by a comma and then the actual text of the article. The comma here is used to delimit columns. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "balanced-settle",
   "metadata": {},
   "source": [
    "## Defining useful global variables\n",
    "Next you will define some global variables that will be used throughout the assignment. Feel free to reference them in the upcoming exercises:\n",
    "\n",
    "- `VOCAB_SIZE`: The maximum number of words to keep, based on word frequency. Defaults to 1000.\n",
    "  \n",
    "- `EMBEDDING_DIM`: Dimension of the dense embedding, will be used in the embedding layer of the model. Defaults to 16.\n",
    "  \n",
    "- `MAX_LENGTH`: Maximum length of all sequences. Defaults to 120.\n",
    "  \n",
    "- `TRAINING_SPLIT`: Proportion of data used for training. Defaults to 0.8\n",
    "  \n",
    "**A note about grading:**\n",
    "\n",
    "**When you submit this assignment for grading these same values for these globals will be used so make sure that all your code works well with these values. After submitting and passing this assignment, you are encouraged to come back here and play with these parameters to see the impact they have in the classification process. Since this next cell is frozen, you will need to copy the contents into a new cell and run it to overwrite the values for these globals.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "quantitative-mauritius",
   "metadata": {
    "deletable": false,
    "editable": false,
    "tags": [
     "graded"
    ]
   },
   "outputs": [],
   "source": [
    "VOCAB_SIZE = 1000\n",
    "EMBEDDING_DIM = 16\n",
    "MAX_LENGTH = 120\n",
    "TRAINING_SPLIT = 0.8"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "synthetic-beijing",
   "metadata": {},
   "source": [
    "## Loading and pre-processing the data\n",
    "\n",
    "Go ahead and open the data by running the cell below. While there are many ways in which you can do this, this implementation takes advantage of the Numpy function [`loadtxt`](https://numpy.org/doc/stable/reference/generated/numpy.loadtxt.html) to load the data. Since the file is saved in a csv format, you need to set the parameter `delimiter=','`, otherwise the function splits at whitespaces by default. Also, you need to set ` dtype='str'` to indicate that the expected content type is a string."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "flying-lincoln",
   "metadata": {
    "deletable": false,
    "editable": false,
    "tags": []
   },
   "outputs": [],
   "source": [
    "data_dir = \"data/bbc-text.csv\"\n",
    "data = np.loadtxt(data_dir, delimiter=',', skiprows=1, dtype='str', comments=None)\n",
    "print(f\"Shape of the data: {data.shape}\")\n",
    "print(f\"{data[0]}\\n{data[1]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e2d11671-76a9-4699-a7e7-1463a8f5890d",
   "metadata": {},
   "source": [
    "As expected, you get a Numpy array with shape `(2225, 2)`. This means that you have 2225 rows, and 2 columns. As seen in the output of the previous cell, the first column corresponds to labels, and the second one corresponds to texts. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "sublime-maine",
   "metadata": {
    "deletable": false,
    "editable": false,
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Test the function\n",
    "print(f\"There are {len(data)} sentence-label pairs in the dataset.\\n\")\n",
    "print(f\"First sentence has {len((data[0,1]).split())} words.\\n\")\n",
    "print(f\"The first 5 labels are {data[:5,0]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "consecutive-battle",
   "metadata": {},
   "source": [
    "***Expected Output:***\n",
    "\n",
    "```\n",
    "There are 2225 sentence-label pairs in the dataset.\n",
    "\n",
    "First sentence has 737 words.\n",
    "\n",
    "The first 5 labels are ['tech' 'business' 'sport' 'sport' 'entertainment']\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "polished-eagle",
   "metadata": {},
   "source": [
    "## Training - Validation Datasets\n",
    "\n",
    "### Exercise 1: train_val_datasets\n",
    "Now you will code the `train_val_datasets` function, which, given the `data` DataFrame, should return the training and validation datasets, consisting of `(text, label)` pairs. For this last part, you will be using the [tf.data.Dataset.from_tensor_slices](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_tensor_slices) method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "small-violence",
   "metadata": {
    "deletable": false,
    "tags": [
     "graded"
    ]
   },
   "outputs": [],
   "source": [
    "# GRADED FUNCTIONS: train_val_datasets\n",
    "def train_val_datasets(data):\n",
    "    '''\n",
    "    Splits data into traning and validations sets\n",
    "    \n",
    "    Args:\n",
    "        data (np.array): array with two columns, first one is the label, the second is the text\n",
    "    \n",
    "    Returns:\n",
    "        (tf.data.Dataset, tf.data.Dataset): tuple containing the train and validation datasets\n",
    "    '''\n",
    "    ### START CODE HERE ###\n",
    "\n",
    "    # Compute the number of sentences that will be used for training (should be an integer)\n",
    "    train_size = None\n",
    "\n",
    "    # Slice the dataset to get only the texts. Remember that texts are on the second column\n",
    "    texts = None\n",
    "    # Slice the dataset to get only the labels. Remember that labels are on the first column\n",
    "    labels = None\n",
    "    # Split the sentences and labels into train/validation splits. Write your own code below\n",
    "    train_texts = None\n",
    "    validation_texts = None\n",
    "    train_labels = None\n",
    "    validation_labels = None\n",
    "    \n",
    "    # create the train and validation datasets from the splits\n",
    "    train_dataset = tf.data.Dataset.from_tensor_slices((None, None))\n",
    "    validation_dataset = tf.data.Dataset.from_tensor_slices((None, None))\n",
    "    \n",
    "\t### END CODE HERE ### \n",
    "    \n",
    "    return train_dataset, validation_dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "circular-venue",
   "metadata": {
    "deletable": false,
    "editable": false,
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Create the datasets\n",
    "train_dataset, validation_dataset = train_val_datasets(data)\n",
    "\n",
    "print(f\"There are {train_dataset.cardinality()} sentence-label pairs for training.\\n\")\n",
    "print(f\"There are {validation_dataset.cardinality()} sentence-label pairs for validation.\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "recovered-graph",
   "metadata": {},
   "source": [
    "***Expected Output:***\n",
    "\n",
    "```\n",
    "There are 1780 sentence-label pairs for training.\n",
    "\n",
    "There are 445 sentence-label pairs for validation.\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "27ed81ba",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "# Test your code!\n",
    "unittests.test_train_val_datasets(train_val_datasets)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6e7b32c6-98b1-4881-8bfb-ead17001c53b",
   "metadata": {},
   "source": [
    "## Vectorization - Sequences and padding\n",
    "\n",
    "With your training and validation data it is now time to perform the vectorization. However, first you need an important intermediate step which is to define a standardize function, which will be used to apply a transformation to every entry in your dataset in an attempt to standardize it. In this case you will use a function that removes [stopwords](https://en.wikipedia.org/wiki/Stop_word) from the texts in the dataset. This should improve the performance of your classifier by removing frequently used words that don't add information to determine the topic of the news. The function also removes any punctuation and makes all words lowercase. This function is already provided for you and can be found in the cell below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2b87dbce-06a2-43b0-b098-b23597101645",
   "metadata": {
    "deletable": false,
    "editable": false,
    "tags": [
     "graded"
    ]
   },
   "outputs": [],
   "source": [
    "def standardize_func(sentence):\n",
    "    \"\"\"\n",
    "    Removes a list of stopwords\n",
    "    \n",
    "    Args:\n",
    "        sentence (tf.string): sentence to remove the stopwords from\n",
    "    \n",
    "    Returns:\n",
    "        sentence (tf.string): lowercase sentence without the stopwords\n",
    "    \"\"\"\n",
    "    # List of stopwords\n",
    "    stopwords = [\"a\", \"about\", \"above\", \"after\", \"again\", \"against\", \"all\", \"am\", \"an\", \"and\", \"any\", \"are\", \"as\", \"at\", \"be\", \"because\", \"been\", \"before\", \"being\", \"below\", \"between\", \"both\", \"but\", \"by\", \"could\", \"did\", \"do\", \"does\", \"doing\", \"down\", \"during\", \"each\", \"few\", \"for\", \"from\", \"further\", \"had\", \"has\", \"have\", \"having\", \"he\", \"her\", \"here\",  \"hers\", \"herself\", \"him\", \"himself\", \"his\", \"how\",  \"i\", \"if\", \"in\", \"into\", \"is\", \"it\", \"its\", \"itself\", \"let's\", \"me\", \"more\", \"most\", \"my\", \"myself\", \"nor\", \"of\", \"on\", \"once\", \"only\", \"or\", \"other\", \"ought\", \"our\", \"ours\", \"ourselves\", \"out\", \"over\", \"own\", \"same\", \"she\",  \"should\", \"so\", \"some\", \"such\", \"than\", \"that\",  \"the\", \"their\", \"theirs\", \"them\", \"themselves\", \"then\", \"there\", \"these\", \"they\", \"this\", \"those\", \"through\", \"to\", \"too\", \"under\", \"until\", \"up\", \"very\", \"was\", \"we\",  \"were\", \"what\",  \"when\", \"where\", \"which\", \"while\", \"who\", \"whom\", \"why\", \"why\", \"with\", \"would\", \"you\",  \"your\", \"yours\", \"yourself\", \"yourselves\", \"'m\",  \"'d\", \"'ll\", \"'re\", \"'ve\", \"'s\", \"'d\"]\n",
    " \n",
    "    # Sentence converted to lowercase-only\n",
    "    sentence = tf.strings.lower(sentence)\n",
    "    \n",
    "    # Remove stopwords\n",
    "    for word in stopwords:\n",
    "        if word[0] == \"'\":\n",
    "            sentence = tf.strings.regex_replace(sentence, rf\"{word}\\b\", \"\")\n",
    "        else:\n",
    "            sentence = tf.strings.regex_replace(sentence, rf\"\\b{word}\\b\", \"\")\n",
    "    \n",
    "    # Remove punctuation\n",
    "    sentence = tf.strings.regex_replace(sentence, r'[!\"#$%&()\\*\\+,-\\./:;<=>?@\\[\\\\\\]^_`{|}~\\']', \"\")\n",
    "\n",
    "\n",
    "    return sentence"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "79ca0c7b",
   "metadata": {},
   "source": [
    "Run the cell below to see this standardizing function in action. You can also try with your own sentences:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "77760bc6",
   "metadata": {
    "deletable": false
   },
   "outputs": [],
   "source": [
    "test_sentence = \"Hello! We're just about to see this function in action =)\"\n",
    "standardized_sentence = standardize_func(test_sentence)\n",
    "print(f\"Original sentence is:\\n{test_sentence}\\n\\nAfter standardizing:\\n{standardized_sentence}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d0b035e2-20b5-4580-b128-770db49097f8",
   "metadata": {},
   "source": [
    "### Exercise 2: fit_vectorizer\n",
    "\n",
    "Next complete the `fit_vectorizer` function below. This function should return a [TextVectorization](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization) layer that has already been fitted on the training sentences. The vocabulary learned by the vectorizer should have `VOCAB_SIZE` size, and truncate the output sequences to have `MAX_LENGTH` length.  \n",
    "\n",
    "Remember to use the custom function `standardize_func` to standardize each sentence in the vectorizer. You can do this by passing the function to the `standardize` parameter of `TextVectorization`. You are encouraged to take a look into the [documentation](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization) to get a better understanding of how this works. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "recreational-prince",
   "metadata": {
    "deletable": false,
    "lines_to_next_cell": 2,
    "tags": [
     "graded"
    ]
   },
   "outputs": [],
   "source": [
    "# GRADED FUNCTION: fit_tvectorizer\n",
    "def fit_vectorizer(train_sentences, standardize_func):\n",
    "    '''\n",
    "    Defines and adapts the text vectorizer\n",
    "\n",
    "    Args:\n",
    "        train_sentences (tf.data.Dataset): sentences from the train dataset to fit the TextVectorization layer\n",
    "        standardize_func (FunctionType): function to remove stopwords and punctuation, and lowercase texts.\n",
    "    Returns:\n",
    "        TextVectorization: adapted instance of TextVectorization layer\n",
    "    '''\n",
    "    ### START CODE HERE ###\n",
    "    \n",
    "    # Instantiate the Tokenizer class, passing in the correct values for num_words and oov_token\n",
    "    vectorizer = tf.keras.layers.TextVectorization( \n",
    "\t\tstandardize=None,\n",
    "\t\tmax_tokens=None,\n",
    "\t\toutput_sequence_length=None\n",
    "\t) \n",
    "    \n",
    "    # Fit the tokenizer to the training sentences\n",
    "    \n",
    "\t\n",
    "    ### END CODE HERE ###\n",
    "    \n",
    "    return vectorizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "great-trading",
   "metadata": {
    "deletable": false,
    "editable": false,
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Create the vectorizer\n",
    "text_only_dataset = train_dataset.map(lambda text, label: text)\n",
    "vectorizer = fit_vectorizer(text_only_dataset, standardize_func)\n",
    "vocab_size = vectorizer.vocabulary_size()\n",
    "\n",
    "print(f\"Vocabulary contains {vocab_size} words\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "pressing-recipe",
   "metadata": {},
   "source": [
    "***Expected Output:***\n",
    "\n",
    "```\n",
    "Vocabulary contains 1000 words\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9c139a2e",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "# Test your code!\n",
    "unittests.test_fit_vectorizer(fit_vectorizer, standardize_func)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "familiar-reform",
   "metadata": {},
   "source": [
    "### Exercise 3: fit_label_encoder\n",
    "\n",
    "Remember your categories are also text labels, so you need to encode the labels as well. For this complete the `tokenize_labels` function below.\n",
    "\n",
    "A couple of things to note:\n",
    "- Use the function [`tf.keras.layers.StringLookup`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/StringLookup) to encode the labels. Use the correct parameters so that you don't include any OOV tokens.\n",
    "- You should fit the tokenizer to all the labels to avoid the case of a particular label not being present in the validation set. Since you are dealing with labels there should never be an OOV label. For this, you can concatenate the two datasets using the [`concatenate`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#concatenate) method from `tf.data.Dataset` objects.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "active-objective",
   "metadata": {
    "deletable": false,
    "id": "XkWiQ_FKZNp2",
    "lines_to_next_cell": 2,
    "tags": [
     "graded"
    ]
   },
   "outputs": [],
   "source": [
    "# GRADED FUNCTION: fit_label_encoder\n",
    "def fit_label_encoder(train_labels, validation_labels):\n",
    "    \"\"\"Creates an instance of a StringLookup, and trains it on all labels\n",
    "\n",
    "    Args:\n",
    "        train_labels (tf.data.Dataset): dataset of train labels\n",
    "        validation_labels (tf.data.Dataset): dataset of validation labels\n",
    "\n",
    "    Returns:\n",
    "        tf.keras.layers.StringLookup: adapted encoder for train and validation labels\n",
    "    \"\"\"\n",
    "    ### START CODE HERE ###\n",
    "    \n",
    "    # join the two label datasets\n",
    "    labels = None #concatenate the two datasets.\n",
    "    \n",
    "    # Instantiate the StringLookup layer. Remember that you don't want any OOV tokens\n",
    "    label_encoder = None\n",
    "    \n",
    "    # Fit the TextVectorization layer on the train_labels\n",
    "    label_encoder.None\n",
    "   \n",
    "    ### END CODE HERE ###\n",
    "    \n",
    "    return label_encoder"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "541096eb-ac6b-4a73-b787-5bf1158a5f13",
   "metadata": {},
   "source": [
    "Use your function to create a trained instance of the encoder, and print the obtained vocabulary to check that there are no OOV tokens."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4c6a7e5a-ea50-4663-8062-d076dcd5313f",
   "metadata": {
    "deletable": false,
    "editable": false,
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Create the label encoder\n",
    "train_labels_only = train_dataset.map(lambda text, label: label)\n",
    "validation_labels_only = validation_dataset.map(lambda text, label: label)\n",
    "\n",
    "label_encoder = fit_label_encoder(train_labels_only,validation_labels_only)\n",
    "                                  \n",
    "print(f'Unique labels: {label_encoder.get_vocabulary()}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "28bd6b8d-2f46-4e85-9b6e-f2d532a038f4",
   "metadata": {},
   "source": [
    "***Expected Output:***\n",
    "\n",
    "```\n",
    "Unique labels: ['sport', 'business', 'politics', 'tech', 'entertainment']\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8a87c9db",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "# Test your code!\n",
    "unittests.test_fit_label_encoder(fit_label_encoder)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "sweet-sentence",
   "metadata": {},
   "source": [
    "### Exercise 4: preprocess_dataset\n",
    "\n",
    "Now that you have trained the vectorizer for the texts and the encoder for the labels, it's time for you to actually transform the dataset. For this complete the `preprocess_dataset` function below. \n",
    "Use this function to set the dataset batch size to 32\n",
    "\n",
    "Hint:\n",
    "- You can apply the preprocessing to each pair or text and label by using the [`.map`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#map) method.\n",
    "- You can set the batchsize to any Dataset by using the [`.batch`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#batch) method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fourth-knight",
   "metadata": {
    "deletable": false,
    "tags": [
     "graded"
    ]
   },
   "outputs": [],
   "source": [
    "# GRADED FUNCTION: preprocess_dataset\n",
    "def preprocess_dataset(dataset, text_vectorizer, label_encoder):\n",
    "    \"\"\"Apply the preprocessing to a dataset\n",
    "\n",
    "    Args:\n",
    "        dataset (tf.data.Dataset): dataset to preprocess\n",
    "        text_vectorizer (tf.keras.layers.TextVectorization ): text vectorizer\n",
    "        label_encoder (tf.keras.layers.StringLookup): label encoder\n",
    "\n",
    "    Returns:\n",
    "        tf.data.Dataset: transformed dataset\n",
    "    \"\"\"\n",
    "    \n",
    "    ### START CODE HERE ###\n",
    "\n",
    "    # Convert the Dataset sentences to sequences, and encode the text labels\n",
    "    dataset = None\n",
    "    dataset = dataset = None # Set a batchsize of 32\n",
    "    \n",
    "\t### END CODE HERE ###\n",
    "    \n",
    "    return dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "separate-onion",
   "metadata": {
    "deletable": false,
    "editable": false,
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Preprocess your dataset\n",
    "train_proc_dataset = preprocess_dataset(train_dataset, vectorizer, label_encoder)\n",
    "validation_proc_dataset = preprocess_dataset(validation_dataset, vectorizer, label_encoder)\n",
    "\n",
    "print(f\"Number of batches in the train dataset: {train_proc_dataset.cardinality()}\")\n",
    "print(f\"Number of batches in the validation dataset: {validation_proc_dataset.cardinality()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "sufficient-locator",
   "metadata": {},
   "source": [
    "***Expected Output:***\n",
    "\n",
    "```\n",
    "Number of batches in the train dataset: 56\n",
    "Number of batches in the validation dataset: 14\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7975a5b2-2a09-4cdd-8eba-f8a54a3fcae3",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "train_batch = next(train_proc_dataset.as_numpy_iterator())\n",
    "validation_batch = next(validation_proc_dataset.as_numpy_iterator())\n",
    "\n",
    "print(f\"Shape of the train batch: {train_batch[0].shape}\")\n",
    "print(f\"Shape of the validation batch: {validation_batch[0].shape}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "47f4e785-1513-4e31-8dec-1c3b39292a9b",
   "metadata": {},
   "source": [
    "Expected output:\n",
    "\n",
    "```\n",
    "Shape of the train batch: (32, 120)\n",
    "Shape of the validation batch: (32, 120)\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b6304976",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "# Test your code!\n",
    "unittests.test_preprocess_dataset(preprocess_dataset, vectorizer, label_encoder)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "devoted-helen",
   "metadata": {},
   "source": [
    "## Selecting the model for text classification\n",
    "### Exercise 5: create_model\n",
    "Now that the data is ready to be fed into a Neural Network it is time for you to define the model that will classify each text as being part of a certain category. \n",
    "\n",
    "For this complete the `create_model` below. \n",
    "\n",
    "A couple of things to keep in mind:\n",
    "\n",
    "- The last layer should be a Dense layer with 5 units (since there are 5 categories) with a softmax activation.\n",
    "\n",
    "\n",
    "- You should also compile your model using an appropriate loss function and optimizer.\n",
    "\n",
    "\n",
    "- You can use any architecture you want but keep in mind that this problem doesn't need many layers to be solved successfully. You don't need any layers beside Embedding, [GlobalAveragePooling1D](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GlobalAveragePooling1D) and Dense layers but feel free to try out different architectures.\n",
    "\n",
    "- **To pass this graded function your model should reach at least a 95% training accuracy and a 90% validation accuracy in under 30 epochs.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "little-bahrain",
   "metadata": {
    "deletable": false,
    "id": "HZ5um4MWZP-W",
    "tags": [
     "graded"
    ]
   },
   "outputs": [],
   "source": [
    "# GRADED FUNCTION: create_model\n",
    "def create_model():\n",
    "    \"\"\"\n",
    "    Creates a text classifier model\n",
    "    Returns:\n",
    "      tf.keras Model: the text classifier model\n",
    "    \"\"\"\n",
    "   \n",
    "    ### START CODE HERE ###\n",
    "\t\n",
    "    # Define your model\n",
    "    model = tf.keras.Sequential([ \n",
    "        tf.keras.Input(None),\n",
    "        None,\n",
    "    ])\n",
    "    \n",
    "    # Compile model. Set an appropriate loss, optimizer and metrics\n",
    "    model.compile(\n",
    "\t\tloss=None,\n",
    "\t\toptimizer=None,\n",
    "\t\tmetrics=['accuracy'] \n",
    "\t) \n",
    "\n",
    "    ### END CODE HERE ###\n",
    "\n",
    "    return model"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a04c90e4",
   "metadata": {},
   "source": [
    "The next cell allows you to check the number of total and trainable parameters of your model and prompts a warning in case these exceeds those of a reference solution, this serves the following 3 purposes listed in order of priority:\n",
    "\n",
    "- Helps you prevent crashing the kernel during training.\n",
    "\n",
    "- Helps you avoid longer-than-necessary training times.\n",
    "- Provides a reasonable estimate of the size of your model. In general you will usually prefer smaller models given that they accomplish their goal successfully.\n",
    "\n",
    "\n",
    "**Notice that this is just informative** and may be very well below the actual limit for size of the model necessary to crash the kernel. So even if you exceed this reference you are probably fine. However, **if the kernel crashes during training or it is taking a very long time and your model is larger than the reference, come back here and try to get the number of parameters closer to the reference.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "resident-productivity",
   "metadata": {
    "deletable": false,
    "editable": false,
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Get the untrained model\n",
    "model = create_model()\n",
    "\n",
    "# Check the parameter count against a reference solution\n",
    "unittests.parameter_count(model)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3e0814ce",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "example_batch = train_proc_dataset.take(1)\n",
    "\n",
    "try:\n",
    "\tmodel.evaluate(example_batch, verbose=False)\n",
    "except:\n",
    "\tprint(\"Your model is not compatible with the dataset you defined earlier. Check that the loss function and last layer are compatible with one another.\")\n",
    "else:\n",
    "\tpredictions = model.predict(example_batch, verbose=False)\n",
    "\tprint(f\"predictions have shape: {predictions.shape}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5d1d634b",
   "metadata": {},
   "source": [
    "**Expected output:**\n",
    "```\n",
    "predictions have shape: (32, 5)\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bfa474c9",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "# Test your code!\n",
    "unittests.test_create_model(create_model)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "498bf653",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "history = model.fit(train_proc_dataset, epochs=30, validation_data=validation_proc_dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "three-pension",
   "metadata": {},
   "source": [
    "Once training has finished you can run the following cell to check the training and validation accuracy achieved at the end of each epoch.\n",
    "\n",
    "**Remember that to pass this assignment your model should achieve a training accuracy of at least 95% and a validation accuracy of at least 90%. If your model didn't achieve these thresholds, try training again with a different model architecture.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "rural-sheffield",
   "metadata": {
    "deletable": false,
    "editable": false,
    "tags": []
   },
   "outputs": [],
   "source": [
    "def plot_graphs(history, metric):\n",
    "    plt.plot(history.history[metric])\n",
    "    plt.plot(history.history[f'val_{metric}'])\n",
    "    plt.xlabel(\"Epochs\")\n",
    "    plt.ylabel(metric)\n",
    "    plt.legend([metric, f'val_{metric}'])\n",
    "    plt.show()\n",
    "    \n",
    "plot_graphs(history, \"accuracy\")\n",
    "plot_graphs(history, \"loss\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "material-breast",
   "metadata": {},
   "source": [
    "If your model passes the previously mentioned thresholds, and you are happy with the results, be sure to save your notebook and submit it for grading. Also run the cell below to save the history of the model. This is needed for grading purposes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2fab48f5",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "with open('history.pkl', 'wb') as f:\n",
    "    pickle.dump(history.history, f)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "primary-tennessee",
   "metadata": {},
   "source": [
    "## Optional Exercise - Visualizing 3D Vectors\n",
    "\n",
    "As you saw on the lecture you can visualize the vectors associated with each word in the training set in a 3D space.\n",
    "\n",
    "For this run the following cell, which will create the `metadata.tsv` and `weights.tsv` files. These are the ones you are going to upload to[Tensorflow's Embedding Projector](https://projector.tensorflow.org/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "awful-geneva",
   "metadata": {
    "deletable": false,
    "editable": false,
    "id": "OhnFA_TDXrih",
    "tags": []
   },
   "outputs": [],
   "source": [
    "embedding = model.layers[0]\n",
    "\n",
    "with open('./metadata.tsv', \"w\") as f:\n",
    "    for word in vectorizer.get_vocabulary():\n",
    "        f.write(\"{}\\n\".format(word))\n",
    "weights = tf.Variable(embedding.get_weights()[0][1:])\n",
    "\n",
    "with open('./weights.tsv', 'w') as f: \n",
    "    for w in weights:\n",
    "        f.write('\\t'.join([str(x) for x in w.numpy()]) + \"\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "further-sunset",
   "metadata": {},
   "source": [
    "By running the previous cell, these files are placed within your filesystem. To download them, right click on the file, which you will see on the left sidebar, and select the `Download` option. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "sudden-investigator",
   "metadata": {},
   "source": [
    "**Congratulations on finishing this week's assignment!**\n",
    "\n",
    "You have successfully implemented a neural network capable of classifying text and also learned about embeddings and tokenization along the way!\n",
    "\n",
    "**Keep it up!**"
   ]
  }
 ],
 "metadata": {
  "dlai_version": "1.2.0",
  "grader_version": "1",
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
